Done!
Done!
# access the MASS package
library(MASS)
# load the data
data("Boston")
# pass Boston to another object for easy typing
bos <- Bostonlibrary(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
#explore structure
str(bos)## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#explore dimensions
dim(bos)## [1] 506 14
#generate a codebook
# string is copy from dataset introduction
codebook <- data.frame(variable = "CRIM - per capita crime rate by town/ZN - proportion of residential land zoned for lots over 25,000 sq.ft./INDUS - proportion of non-retail business acres per town./CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)/NOX - nitric oxides concentration (parts per 10 million)/RM - average number of rooms per dwelling/AGE - proportion of owner-occupied units built prior to 1940/DIS - weighted distances to five Boston employment centres/RAD - index of accessibility to radial highways/TAX - full-value property-tax rate per $10,000/PTRATIO - pupil-teacher ratio by town/B - 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town/LSTAT - % lower status of the population/MEDV - Median value of owner-occupied homes in $1000's")
codebook <- codebook %>%
separate_rows(variable, sep = "/") %>% # "/" is the delimiter for rows
separate(variable, sep = " - ", #" - " is the delimiter for variables
into = c("name", "description"), # names of sparated variables
remove = T) #remove old column
#check codebook
library(DT)
codebook %>% datatable The data set has 506 observations of 14 variables. Variable CHAS is a dummy variable where 1 means the the place having tract that bounds river, and 0 means otherwise. It needs to be converted to a factor.
bos <- bos %>%
mutate(chas = chas %>%
factor() %>%
fct_recode(
"With tracts that bonds river" = "1", #old value 1 to new label
"Otherwise" = "0") # old value 0 to new label
)Each of the 506 rows in the dataset describes a Boston suburb or town, and it has 14 columns with information such as average number of rooms per dwelling, pupil-teacher ratio, and per capita crime rate. The last row describes the median price of owner-occupied homes.
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
#define a function that allows me to fine-tune the matrix
my.fun <- function(data, mapping, method = "lm",...){ #define arguments
p <- ggplot(data = data, mapping = mapping) + #pass arguments
geom_point(size = 0.3,
color = "blue",...) + #define points size and color
geom_smooth(size = 0.5,
color = "red",
method = method) #define line size and color; define lm regression
p #print the results
}
#the abbreviated variable names are not self-explanatory, set column and row
#names to be the variable labels for better reading
#this new object will be used in ggpairs function
names1 <- pull(codebook[1:7,], description) # extract row 1:7 of var description
names1 <- sapply(names1, #collapse the description into multiple lines
function(x) paste(strwrap(x, 35), # for better reading
collapse = "\n")) # "\n" calls for a new line
ggpairs(bos,
lower = list(
continuous = my.fun,
combo = wrap("facethist", bins = 20)),
col = 1:7,
columnLabels = names1) #define column labels as the names I just set## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Visualized relations of Boston dataset, variable #1~#7
Note that variable about crime rate is plagued with outliers.
#repeat what is done in the last chunk for variable 8~14
names2 <- pull(codebook[8:14,], description)
names2 <- sapply(names1, function(x) paste(strwrap(x, 35), collapse = "\n"))
ggpairs(bos,
lower = list(
continuous = my.fun),
col = 8:14,
columnLabels = names2,
)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Visualized relations of Boston dataset, variable #8~#14
library(finalfit)
#summarize the continuous data
ff_glimpse(bos)$Continuous %>% datatable# summarize the categorical data
ff_glimpse(bos)$Categorical %>% datatableThere are 13 continuous variables in the dataset. The crime rate of the town was 0.3(0.1~3.7)%; the proportion of a town’s residential land zoned for lots over 25,000 sq.ft. was 0 (0~12.5)%; the proportion of non-retail business acres per town was 9.7(5.2~18.1)%; the nitric oxides concentration was 0.5(0.4~0.6) parts per 10 million; the average number of rooms per dwelling was 6.3±0.7 rooms; the proportion of owner-occupied units built prior to 1940 was 77.5(45.0~94.1)%; the weighted distances to five Boston employment centres was 3.2 (2.1~5.2) kilometers; the index of accessibility to radial highways was 5(4~24) units of accessibility; the full-value property-tax rate was $330(279~666) per $10,000; the pupil-teacher ratio by town was 19.1(17.4~20.2); the Black proportion of population after taking the formula of 1000(Bk-0.63)^2 was 391.4(375.4~396.2); the proportion of population that is lower status was 11.4(6.9~17.0)%; the median value of owner-occupied homes was $21.2(17~25)*1000. #### 3.3.2 interpreting categorical variable
35(6.9%) towns have tracts that bonds Charles River.
Except for the one binary variable about tract that bonds river, each variable in our data set shows a >0.3 and/or <-0.3 correlation with at least one of the other variables. Some of them have correlation as high as 0.9. All of the correlation coefficients are significant (p<0.001).
library(MASS)
#binary variables with values as number will not influence the result of
#standardization and clustering, hence I will reload Boston without re-coding
#binary variable. This is for easiness of matrix multiplication in the following
#operations
bos <- Boston
bos.s <- as.data.frame(scale(bos))# bos.s means Boston Scaledff_glimpse(bos.s)$Con %>% datatableAll the variables after scaling had a mean of 0 and most of
variables’ values ranged from -4 and 4, only except for variables crim
(crime rate), which might be due to out-liers (corresponds to the
finding from the correlation matrix).
## 4.3 Use the quantiles as the break points in the categorical variable
and drop the old crime rate variable from the dataset.
#generate cutoff according to quantile
bins <- quantile(bos.s$crim)
#generate a categorical variable "crime" and re-code it
bos.s <- bos.s %>%
mutate(crime = crim %>%
cut(breaks = bins, include.lowest = TRUE) %>%
fct_recode("Low" = "[-0.419,-0.411]",
"MediumLow" = "(-0.411,-0.39]",
"MediumHigh" = "(-0.39,0.00739]",
"High" = "(0.00739,9.92]"))
#remove crim
bos.s <- bos.s %>% select(-crim)set.seed(2022)
#generate an object containing the number of observations in bos dataset
n <- nrow(bos.s)
#generate an object "ind", which contains a random selected set of the indexing
#of bos dataset, and the number of indexing takes up 80% of number of observations
ind <- sample(1:n, size = n*0.8)
#generate train&test sets according to the random set of indexing number
train <- bos.s[ind,]
test <- bos.s[-ind,]# fit an linear discriminant model on the train set, named as "lda.fit"
lda.fit <- lda(crime ~ ., data = train) # the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(factor(train$crime))#plot the lda results
plot(lda.fit, dimen = 2, pch = classes, col = classes)+
lda.arrows(lda.fit, myscale = 4)Fig 5.2 Biplot for LDA for clustering crime rate
## integer(0)
Biplot based on LD1 and LD2 was generated, see fig 5.2. The most of four clusters separated poorly, except for the cluster “High”. Heavy overlap was observed between each pair of other cluster. Besides, Clusters High and MediumHigh also showed notable overlaps.
Based on arrows, varaibles lstat explained the most for cluster High. Contributions of variables to other clusters are not clear enough due to the heavy overlap.
#save crime into an object
classes.test <- test$crime
#remove crime
test$crime <- NULLpredicted.test <- predict(lda.fit, test)#generate a table that evaluate the accuracy of model, and pass the table into
#an object named "accuracy.tab"
accuracy.tab <- table(correct = classes.test, predicted = predicted.test$class )
#show the accuracy table
accuracy.tab## predicted
## correct Low MediumLow MediumHigh High
## Low 4 7 0 0
## MediumLow 10 16 4 0
## MediumHigh 2 11 17 1
## High 0 0 0 30
#ask R to identify the correct predictions and add them up
correct.n = 0 # the number of correct predictions starting at 0
for (i in 1:4){ #4 loops because we have 4 rows/columns
correct.c <- accuracy.tab[
which(rownames(accuracy.tab) == colnames(accuracy.tab)[i]),
i] # if a cell has same row and column names, pass its value into "correct.c"
correct.n = correct.c+ correct.n # update the value of correct prediction
} # by adding "correct.c"
# calculate the percent of correct predictions for test set
correct.n/(nrow(bos.s)*0.2) #denominator is the number of obs. in test set## [1] 0.6620553
#reload Boston
data("Boston")
bos <- Boston
#standardize the dataset
bos.s <- as.data.frame(scale(bos))dis_eu <- dist(bos.s)
summary(dis_eu)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
bos.s.km <- kmeans(bos.s, centers = 4) date()## [1] "Tue Nov 22 18:30:37 2022"
set.seed(22) #22 is the date I carried out the analysis
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bos.s, k)$tot.withinss})# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')+
geom_line(aes(x = 3, color = "red")) +
annotate("text",
label =
"←Elbow effect happens here",
x = 4.8, y =3800, color = "red") Fig 7.4 Elbow plot for trends of within-cluster sum-of-square with increasing number of k
There is a huge reduction in variation with K =3, but after that the variation does not go down as quickly. I will use K = 3 and do the k-means clustering again.
km <- kmeans(bos.s, centers = 3)#define a function that allows me to fine-tune the matrix
my.fun.km <- function(data, mapping,...){ #define arguments
p <- ggplot(data = data, mapping = mapping) + #pass arguments
geom_point(size = 0.3,
color = factor(km$cluster),
...) + #define points size and color
stat_ellipse(geom = "polygon", mapping = mapping, alpha = 0.5)
#calculate an ellipse layer that separate clusters
p #print the results
}
ggpairs(bos, mapping = aes(fill=factor(km$cluster)),
lower = list(
continuous = my.fun.km
),
col = 1:7)Fig 7.6.1 Correlation Matrix with clusters
ggpairs(bos, mapping = aes(fill=factor(km$cluster)),
lower = list(
continuous = my.fun.km
),
col = 8:14)Fig 7.6.2 Correlation Matrix with clusters
By observing the elbow plot that depicts the how size of within-cluster sum-of-square changes with number of k, an optimal number of k = 3 was determined. The subsequent results of k-means clustering was visualized in a correlation matrix with containing variables in dataset. It is observed that some of the variables have contributed tremendously to the clustering. For example, the variable black separates 1 cluster with the other 2 clusters nicely, see the 5th column of the picture above (Fig 7.6.2), x axis; and the variable age separates a different cluster with the other 2 clusters roughly, see the 7th row of the picture above (Fig 7.6.1), y axis. Some pairs of the variables have also played important role in clustering, for example, the combination of black and dist variables separate the 3 clusters roughly, see picture above (Fig 7.6.2, column 1nd, row 5th) or see the picture below (Fig 7.6.3). Due to the limitation of presenting more dimensions in a 2-dimension screen, I am not able to dig into the clustering effect of more variables combined. Fortunately, k-means clustering has done that for me, mathematically.
bos %>% ggplot(aes(x = dis, y = black, color = factor(km$cluster))) +
geom_point() +
geom_abline(intercept = 480, slope = -25)+
geom_abline(intercept = 400, slope = -80) +
stat_ellipse(geom = "polygon",
aes(fill = km$cluster),
alpha = 0.25)the clustering effect of variables black and dis combined
# k = 3 is the optimal clusters found
km <- kmeans(scale(Boston), centers = 3)#reload and standardize data
bos.s <- as.data.frame(scale(Boston))
#save the clusters identified by k-means clustering as a column in the data set
bos.s$km.cluster <- km$cluster
lda.km <- lda(km.cluster ~ ., data = bos.s) # target classes as numeric
classes <- as.numeric(factor(bos.s$km.cluster))
#plot the lda results as biplot
plot(lda.km, dimen = 2, pch = classes, col = classes)+
lda.arrows(lda.km, myscale = 4)Fig. 8.3 Biplot of the LDA for separating clusters identified by K means distance
## integer(0)
Biplot based on LD1 and LD2 was generated, see fig 8.3. The three clusters separated very clearly and some overlap observed between cluster 1 and cluster 3, and between cluster 2 and cluster 3. Cluster 1 and cluster 2 are perfectly separated.
Based on arrows, variables rm, dis and crim explained more for cluster 1; variables indus, rad, tax and nox explained more for cluster 2; and variables black, chas and ptratio explained more for clusters 3. Other variables’ role in clustering was much weaker.
#reload the data
bos.s <- as.data.frame(scale(Boston))
#generate cutoff according to quantile
bins <- quantile(bos.s$crim)
#generate a categorical variable "crime" and re-code it
bos.s <- bos.s %>%
mutate(crime = crim %>%
cut(breaks = bins, include.lowest = TRUE) %>%
fct_recode("Low" = "[-0.419,-0.411]",
"MediumLow" = "(-0.411,-0.39]",
"MediumHigh" = "(-0.39,0.00739]",
"High" = "(0.00739,9.92]"))
#remove crim
bos.s <- bos.s %>% select(-crim)
set.seed(2022)
#generate an object containing the number of observations
n <- nrow(bos.s)
#generate a random set of indexing number with n = 80% of the obs.
ind <- sample(1:n, size = n*0.8)
#generate the train set and test
train <- bos.s[ind,]
test <- bos.s[-ind,]#select predictors for train set, with outcome variable removed
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)## [1] 404 13
dim(lda.fit$scaling)## [1] 13 3
# matrix multiplication, saving the resulting matrix into matrix_product
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#turn matrix into data frame
matrix_product <- as.data.frame(matrix_product)
#plot the 3D plot with LD1, LD2 and LD3
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p1 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color = train$crime, #Set the color to be the crime classes of the train set.
size = 2)
p1#select predictors for train set, with outcome variable removed
model_predictors <- dplyr::select(train, -crime)
#get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 3)
p2 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color = factor(train.km$cluster), #color defined by clusters of the k-means
size = 1.5)
p2The LDA was trained according to a mathematical category of crime rates (quantiles), which has 4 categories. While k = 3 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 3 clusters identified before observing closely.
However, by observing the pictures together, it is interesting to find out that, cluster 3 from k-means nicely overlaps with High category from LDA. Also, cluster 2 from k-means roughly overlaps with Low and Medium low categories from LDA. As such, I will re-code categories from LDA according to this finding and see closely on how well results from k-means and LDA are consistent.
train.crime3 <- train %>%
mutate(crime3 = crime %>%
fct_recode("Medium" = "MediumHigh" ,
"Low" = "MediumLow",
"High" = "High",
"Low" = "Low"))
km.cluster <- factor(train.km$cluster)
levels(km.cluster) <- c("Medium","Low","High")accuracy.tab <- table(correct = train.crime3$crime3, kmean.pred = km.cluster)
accuracy.tab ## kmean.pred
## correct Medium Low High
## Low 10 187 15
## Medium 10 48 37
## High 5 0 92
# looping through the 3X3 matrix to add up the columns and rows with same name
correct.n = 0
for (i in 1:3){
correct.c <- accuracy.tab[which(rownames(accuracy.tab) == colnames(accuracy.tab)[i]), i]
correct.n = correct.c+ correct.n
}
# calculate the accuracy rate
correct.n/(nrow(bos.s)*0.8)## [1] 0.7139328
It gets an accuracy rate of 71.3%, greatly outperforming the original LDA model, indicating k-mean cluster could be used as a cue for categorizing continuous variables.
6.4 Comment on the results
Overall, 66.2% of the predictions are correct, showing not quite satisfactory predicting effect of our linear discriminant analysis. Observe the result closely, it is found that the for high and medium high crime rate regions, the analysis did the best predictions, with 90% (47/52) of accuracy. For Low and medium low regions, the predictive effect of our analysis decreased tremendously. This might be the result of a. the violation of the assumption of multivariate normality (but evidence showed even when this is violated, LDA also exhibited good accuracy); b. large number of out-liers in the dependent variable before re-coding (LDA is sensitive to out-liers); c. The small size of category Low in test set; d. better categorization strategy for dependent variable needed (the current categorization is only base on quantiles, which is lack of more evidence-based foundation).